Enhancing Urban Tree Planting through Air Quality Analysis

Authored by: Vamshi Krishna Y R
Duration: 90 mins
Level: Intermediate
Pre-requisite Skills: Python, Pandas, Matplotlib, Folium, Geopandas, Plotly, Scikit-learn

Use case scenario

As a: Resident of Melbourne who values sustainable city living and concerned about air pollution and its impact on daily health and wellbeing.

I want to: See key roadside areas in the city selected for tree planting projects based on their potential to reduce airborne pollutants.

So that I can: Breathe cleaner air in urban spaces, benefiting from reduced exposure to traffic-related emissions and a healthier living environment.

By:

  • Analysing microclimate sensor data to identify areas along Melbourne’s roadways with poor air quality and high exposure to vehicular emissions.

  • Mapping these pollution-prone zones against existing tree canopy data to locate streets with minimal natural filtration and shade coverage.

  • Integrating tree planting zone data to pinpoint feasible locations for vegetation expansion that align with municipal greening regulations and infrastructure constraints.

  • Selecting tree species with high pollutant absorption capabilities and modelling their expected impact on local air quality over time.

  • Applying geospatial analysis to overlay environmental indicators—such as pollutant levels, canopy gaps, and permissible planting areas—to inform strategic decision-making.

What this use case will teach you

  • This use case explores how to integrate diverse urban datasets—specifically microclimate sensor readings, tree canopy coverage, and planting zone data—to identify roadside locations where strategic tree planting can significantly improve air quality. By engaging with real-world environmental data, you will address pressing urban health challenges related to pollution exposure.

  • You will gain practical experience in applying geospatial analysis and leveraging microclimate insights to support data-informed decisions in sustainable urban planning.

  • The case study focuses on techniques for assessing how vegetation coverage, pollutant concentrations, and permissible planting zones intersect, equipping you with the tools to prioritise greening interventions that deliver measurable air quality benefits.

  • A key learning outcome will be the development of user-friendly dashboards and interactive visualisations to effectively communicate findings to stakeholders such as urban planners, environmental agencies, and community members.

  • You will also learn to evaluate the environmental and societal implications of targeted greening projects, enabling you to propose data-driven strategies for creating healthier, more breathable urban environments.

Project Goals and expected outcomes

  • This project showcases the capability to integrate and analyse multiple open datasets—specifically microclimate sensor data, existing tree canopy coverage, and designated planting zones—to identify roadside locations where new tree plantings can most effectively improve urban air quality.

  • The analysis will involve spatial and environmental evaluations to prioritise streets and road corridors that currently experience high levels of air pollution and lack sufficient vegetation. These assessments will guide decisions around optimal placement and species selection to maximise air filtration benefits.

  • The goal is to deliver actionable insights that contribute to cleaner urban air by enhancing tree coverage in strategically selected areas. The project supports Melbourne’s broader objectives for sustainability, public health, and environmental resilience.

  • A central output will be the development of an interactive, user-focused dashboard that visualises pollutant levels, canopy gaps, and suitable planting zones. This tool will empower urban planners, environmental stakeholders, and policymakers to make data-informed decisions.

  • Ultimately, the project aims to provide clear, evidence-based recommendations for improving air quality through targeted urban greening—enhancing both the environmental performance of city streets and the everyday wellbeing of Melbourne’s residents.

Initialisation¶

Importing the necessary libraries¶

In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd       # Data manipulation and analysis
import numpy as np        # Numerical operations
from sklearn.impute import SimpleImputer #For handling missing values
import matplotlib.pyplot as plt  # Basic plotting
import plotly.express as px # For complex plotting
import contextily as ctx # for geo spatial plotting
import seaborn as sns     # Advanced statistical data visualization
import geopandas as gpd  # For handling shapefiles, GeoJSON, etc.
import folium           # For interactive mapping
import requests         #For accessing datasets
from io import StringIO #For reading and writing with in a string in memory
from collections import defaultdict #For creating dictionary
import json  #For parsing and reading json files 
from IPython.display import IFrame #display saved visualisations
from shapely.geometry import shape, Point  # For creating and manipulating Geometric Objects
%matplotlib inline

Importing the data through API from open data portal of Melbourne¶

The below function accesses open datasets via API endpoints, enabling users to obtain information in CSV format suitable for in-depth analysis. By providing the dataset identifier and a valid API key, it issues a request to the Melbourne data portal and interprets the response to retrieve pertinent data. This method streamlines the incorporation of diverse datasets—such as microclimate sensor, urban tree canopies, and tree planting zone, facilitating straightforward access and efficient data integration for applications in urban planning research.

In [2]:
def import_data(datasetname): # pass in dataset name and api key

    """
    Imports a dataset from the City of Melbourne Open Data API.

    Parameters:
    - dataset_id (str): The unique dataset identifier.
    Returns:
    - pd.DataFrame: The imported dataset as a pandas DataFrame.
    """

    dataset_id = datasetname

    base_url = 'https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
    apikey = api_key
    dataset_id = dataset_id
    format = 'csv'

    url = f'{base_url}{dataset_id}/exports/{format}'
    params = {
        'select': '*',
        'limit': -1,  # all records
        'lang': 'en',
        'timezone': 'UTC'
    }

    # GET request
    response = requests.get(url, params=params)

    if response.status_code == 200:
        # StringIO to read the CSV data
        url_content = response.content.decode('utf-8')
        datasetname = pd.read_csv(StringIO(url_content), delimiter=';')
        print(f' Imported the {dataset_id} dataset with {len(datasetname)} records succesfully \n')
        return datasetname 
    else:
        return (print(f'Request failed with status code {response.status_code}'))

Datasets¶

The following section illustrates how data is retrieved and prepared for analysis using multiple datasets. The import_data function is employed to access three key datasets: tree canopies, microclimate sensors, and tree planting zones. Each dataset is stored in separate pandas dataframe — tree_canopy_2021, microclimate, and tree_planting. The number of records retrieved from each dataset is displayed, providing a clear overview of the data volume available for exploration. This structured approach ensures that the datasets are properly organized and readily available for integration into analytical workflows within the project.

Importing dataset - tree-canopies-2021-urban-forest

About the dataset: The Tree Canopies 2021 - Urban Forest dataset maps the extent of tree canopy cover across the City of Melbourne using aerial imagery and LiDAR data. It provides detailed spatial insights into urban forest coverage, supporting initiatives in climate resilience, biodiversity, and urban planning.

In [3]:
tree_canopy_2021 = 'tree-canopies-2021-urban-forest' 
df_tree_canopy_2021 = import_data(tree_canopy_2021)
df_tree_canopy_2021_orig = df_tree_canopy_2021 #saving the original dataset
print('First few rows of the dataset:\n',df_tree_canopy_2021.head(5))
 Imported the tree-canopies-2021-urban-forest dataset with 57980 records succesfully 

First few rows of the dataset:
                               geo_point_2d  \
0   -37.77506304683423, 144.93898465421296   
1  -37.775132956993566, 144.93979253397976   
2   -37.77515941768105, 144.93986878938023   
3   -37.775130649990984, 144.9389679196713   
4   -37.77518700935825, 144.93875479828375   

                                           geo_shape  
0  {"coordinates": [[[[144.9389624164712, -37.775...  
1  {"coordinates": [[[[144.93978541786475, -37.77...  
2  {"coordinates": [[[[144.93986368038625, -37.77...  
3  {"coordinates": [[[[144.93894119058956, -37.77...  
4  {"coordinates": [[[[144.93877858274755, -37.77...  

Importing dataset - microclimate-sensors-data

About the dataset: The Microclimate Sensors Data dataset captures real-time environmental conditions across Melbourne using a network of sensors. It includes measurements such as air temperature, humidity, and solar radiation, supporting research in urban heat islands, climate monitoring, and sustainable city planning.

In [4]:
microclimate = 'microclimate-sensors-data' 
df_microclimate = import_data(microclimate)
df_microclimate_orig = df_microclimate #saving the original dataset
print('First few rows of the dataset:\n',df_microclimate.head(5))
 Imported the microclimate-sensors-data dataset with 292472 records succesfully 

First few rows of the dataset:
             device_id                received_at  \
0  ICTMicroclimate-08  2025-02-09T00:54:37+00:00   
1  ICTMicroclimate-11  2025-02-09T01:02:11+00:00   
2  ICTMicroclimate-05  2025-02-09T01:03:24+00:00   
3  ICTMicroclimate-01  2025-02-09T01:02:43+00:00   
4  ICTMicroclimate-09  2025-02-09T01:17:37+00:00   

                                      sensorlocation  \
0  Swanston St - Tram Stop 13 adjacent Federation...   
1                                   1 Treasury Place   
2                 Enterprize Park - Pole ID: COM1667   
3                    Birrarung Marr Park - Pole 1131   
4  SkyFarm (Jeff's Shed). Rooftop - Melbourne Con...   

                    latlong  minimumwinddirection  averagewinddirection  \
0  -37.8184515, 144.9678474                   0.0                 153.0   
1   -37.812888, 144.9750857                   0.0                 144.0   
2  -37.8204083, 144.9591192                   0.0                  45.0   
3  -37.8185931, 144.9716404                   NaN                 150.0   
4  -37.8223306, 144.9521696                   0.0                 241.0   

   maximumwinddirection  minimumwindspeed  averagewindspeed  gustwindspeed  \
0                 358.0               0.0               3.9            7.9   
1                 356.0               0.0               2.0            7.8   
2                 133.0               0.0               1.5            2.7   
3                   NaN               NaN               1.6            NaN   
4                 359.0               0.0               0.9            4.4   

   airtemperature  relativehumidity  atmosphericpressure  pm25  pm10  \
0            23.9         57.300000               1009.7   0.0   0.0   
1            24.5         56.200000               1005.3   0.0   0.0   
2            25.0         60.000000               1009.6   1.0   3.0   
3            23.1         61.099998               1009.0   0.0   5.0   
4            25.6         53.700000               1007.9   0.0   0.0   

       noise  
0  80.500000  
1  62.900000  
2  68.500000  
3  51.700001  
4  60.200000  

Importing dataset - microclimate-sensors-data

About the dataset: The Tree Planting Zone Schedules with Years - Urban Forest dataset outlines planned and completed tree planting activities across various city zones from 2017 onward. It provides scheduling information to support long-term urban greening strategies and enhance canopy coverage in the City of Melbourne.

In [5]:
tree_planting = 'tree-planting-zone-schedules-with-years-urban-forest'
df_tree_planting = import_data(tree_planting)
df_tree_planting_orig = df_tree_planting #saving the original dataset
print('First few rows of the dataset:\n',df_tree_planting.head(5))
 Imported the tree-planting-zone-schedules-with-years-urban-forest dataset with 839 records succesfully 

First few rows of the dataset:
                               geo_point_2d  \
0    -37.8030612625994, 144.96879535330316   
1    -37.81979235869452, 144.9675112730786   
2  -37.796602429685905, 144.96990674715127   
3    -37.79774465832566, 144.9502620746376   
4    -37.81652483951976, 144.9864951838192   

                                           geo_shape  str_from  segpart  \
0  {"coordinates": [[[[144.969387175266, -37.8029...       NaN      NaN   
1  {"coordinates": [[[[144.96828098035, -37.81969...       NaN      NaN   
2  {"coordinates": [[[[144.969341164027, -37.7965...       NaN      NaN   
3  {"coordinates": [[[[144.950283591209, -37.7975...       NaN      NaN   
4  {"coordinates": [[[[144.986101797552, -37.8163...       NaN      NaN   

   statusid  segid  streetname  streetid                         schedule  \
0       NaN  21556         NaN       NaN  Not determined by precinct plan   
1       NaN  22067         NaN       NaN                     Years 8 - 10   
2       NaN  20697         NaN       NaN                      Years 5 - 7   
3       NaN  21195         NaN       NaN                         Complete   
4       NaN  21945         NaN       NaN  Not determined by precinct plan   

   mapstatus  str_to                                           segdescr  
0        NaN     NaN  Pelham Street between Rathdowne Street and Dru...  
1        NaN     NaN  Riverside Avenue between St Kilda Road and Sou...  
2        NaN     NaN  Little Palmerston Street between Rathdowne Str...  
3        NaN     NaN  Chapman Street between Errol Street and Harker...  
4        NaN     NaN  Wellington Parade between Simpson Street and P...  

Data Cleansing and Preprocessing¶

The Data Cleansing and Preprocessing phase focuses on preparing the tree canopies, microclimate sensors, and tree planting zones datasets for analysis. This involves resolving inconsistencies, handling missing entries, and reformatting data as needed—such as separating latitude and longitude fields, removing redundant columns, and ensuring appropriate structure across datasets. These steps are critical to harmonize the datasets for seamless integration and analysis. By standardizing and validating the data, this process enhances the accuracy and reliability of any insights derived.

In [6]:
def split_geo_coordinates(df, geo_column):
    """
    Splits a combined latitude,longitude column into two separate float columns: 'latitude' and 'longitude'.
    
    Parameters:
    - df (pd.DataFrame): The input DataFrame containing the geo column.
    - geo_column (str): The name of the column with 'latitude,longitude' string values.

    Returns:
    - pd.DataFrame: A new DataFrame with separate 'latitude' and 'longitude' columns.
    """
    if geo_column not in df.columns:
        raise ValueError(f"Column '{geo_column}' not found in DataFrame.")

    try:
        # Ensure the geo_column is of string type
        df[geo_column] = df[geo_column].astype(str)

        # Attempt to split the column
        split_data = df[geo_column].str.split(',', expand=True)

        if split_data.shape[1] != 2:
            raise ValueError(f"Column '{geo_column}' does not contain valid 'latitude,longitude' format.")

        df['latitude'] = pd.to_numeric(split_data[0], errors='coerce')
        df['longitude'] = pd.to_numeric(split_data[1], errors='coerce')

        # Drop rows with invalid coordinates
        df.dropna(subset=['latitude', 'longitude'], inplace=True)

        # Drop the original geo column
        df = df.drop(columns=[geo_column])

        print('Dataset Info after Geo Split:\n')
        print(df.info())

    except Exception as e:
        print(f"An error occurred during geolocation splitting: {e}")
        raise

    return df
In [7]:
def check_preprocess_dataset(df_dataset, dataset_name='dataset'):
    """
    Inspects and preprocesses a dataset:
    - Prints dataset info
    - Checks for missing values
    - Removes duplicate rows (if any)

    Parameters:
    - df_dataset (pd.DataFrame): The input DataFrame to be checked and cleaned.
    - dataset_name (str): Optional name of the dataset for logging purposes.

    Returns:
    - pd.DataFrame: A cleaned version of the input DataFrame.
    """
    try:
        if not isinstance(df_dataset, pd.DataFrame):
            raise TypeError("Input is not a pandas DataFrame.")

        print(f'Dataset Information for "{dataset_name}":\n')
        print(df_dataset.info())

        # Check for missing values
        print(f'\nMissing values in "{dataset_name}" dataset:\n')
        print(df_dataset.isnull().sum())

        # Identify and remove duplicates
        dupes = df_dataset.duplicated().sum()
        if dupes > 0:
            df_dataset = df_dataset.drop_duplicates()
            print(f'\nDeleted {dupes} duplicate record(s) from "{dataset_name}".')
        else:
            print(f'\nNo duplicate records found in "{dataset_name}".')

    except Exception as e:
        print(f"An error occurred while preprocessing '{dataset_name}': {e}")
        raise

    return df_dataset

Tree Canopies 2021 dataset

In [8]:
df_tree_canopy_2021 = check_preprocess_dataset(df_tree_canopy_2021, 'Tree Canopies 2021')
Dataset Information for "Tree Canopies 2021":

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57980 entries, 0 to 57979
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   geo_point_2d  57980 non-null  object
 1   geo_shape     57980 non-null  object
dtypes: object(2)
memory usage: 906.1+ KB
None

Missing values in "Tree Canopies 2021" dataset:

geo_point_2d    0
geo_shape       0
dtype: int64

No duplicate records found in "Tree Canopies 2021".

To facilitate spatial analysis, the geo_point_2d column was split into separate latitude and longitude columns. These new columns were then converted into numeric formats to allow for further computations and visualizations. Finally, the original geo_point_2d column was dropped to avoid redundancy, leaving a clean and structured dataset ready for spatial analysis and modeling.

In [9]:
#splitting geo coordinates
df_tree_canopy_2021 = split_geo_coordinates(df_tree_canopy_2021,'geo_point_2d')
print('First few rows of the dataset after preprocessing:\n',df_tree_canopy_2021.head(5))
Dataset Info after Geo Split:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57980 entries, 0 to 57979
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   geo_shape  57980 non-null  object 
 1   latitude   57980 non-null  float64
 2   longitude  57980 non-null  float64
dtypes: float64(2), object(1)
memory usage: 1.3+ MB
None
First few rows of the dataset after preprocessing:
                                            geo_shape   latitude   longitude
0  {"coordinates": [[[[144.9389624164712, -37.775... -37.775063  144.938985
1  {"coordinates": [[[[144.93978541786475, -37.77... -37.775133  144.939793
2  {"coordinates": [[[[144.93986368038625, -37.77... -37.775159  144.939869
3  {"coordinates": [[[[144.93894119058956, -37.77... -37.775131  144.938968
4  {"coordinates": [[[[144.93877858274755, -37.77... -37.775187  144.938755

Microclimate Sensor dataset

In [10]:
#df_microclimate = df_microclimate_orig
df_microclimate = check_preprocess_dataset(df_microclimate, 'Microclimate Sensor')
Dataset Information for "Microclimate Sensor":

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 292472 entries, 0 to 292471
Data columns (total 16 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   device_id             292472 non-null  object 
 1   received_at           292472 non-null  object 
 2   sensorlocation        286329 non-null  object 
 3   latlong               280989 non-null  object 
 4   minimumwinddirection  262620 non-null  float64
 5   averagewinddirection  292002 non-null  float64
 6   maximumwinddirection  262462 non-null  float64
 7   minimumwindspeed      262462 non-null  float64
 8   averagewindspeed      292002 non-null  float64
 9   gustwindspeed         262462 non-null  float64
 10  airtemperature        292002 non-null  float64
 11  relativehumidity      292002 non-null  float64
 12  atmosphericpressure   292002 non-null  float64
 13  pm25                  279254 non-null  float64
 14  pm10                  279254 non-null  float64
 15  noise                 279254 non-null  float64
dtypes: float64(12), object(4)
memory usage: 35.7+ MB
None

Missing values in "Microclimate Sensor" dataset:

device_id                   0
received_at                 0
sensorlocation           6143
latlong                 11483
minimumwinddirection    29852
averagewinddirection      470
maximumwinddirection    30010
minimumwindspeed        30010
averagewindspeed          470
gustwindspeed           30010
airtemperature            470
relativehumidity          470
atmosphericpressure       470
pm25                    13218
pm10                    13218
noise                   13218
dtype: int64

No duplicate records found in "Microclimate Sensor".
In [11]:
#selecting relevant columns
df_microclimate = df_microclimate[['device_id', 'received_at', 
                                   'sensorlocation', 'latlong','airtemperature', 
                                   'relativehumidity',  'pm25', 'pm10', 'noise']
]
In [12]:
print("\nMissing Values:")
print(df_microclimate.isna().sum())
Missing Values:
device_id               0
received_at             0
sensorlocation       6143
latlong             11483
airtemperature        470
relativehumidity      470
pm25                13218
pm10                13218
noise               13218
dtype: int64

Handling Missing Values in sensorlocation and latlong Columns:

To ensure the reliability and integrity of the microclimate sensor dataset, we continue the preprocessing by addressing missing values in two key columns: sensorlocation and latlong.

Given that the device_id field contains no missing values, it serves as a stable identifier for filling the missing values in sensorlocation and latlong. The approach is to first validate that each device has a consistent sensorlocation and latlong throughout the dataset.

In [13]:
def find_unique_counts(df, pivot_col, check_col):
    """
    Computes the number of unique values in `check_col` for each unique value in `pivot_col`.
    
    Parameters:
    - df (pd.DataFrame): Input DataFrame containing the columns to be grouped and counted.
    - pivot_col (str): Column name to group by.
    - check_col (str): Column name for which unique counts are calculated.
    """
    try:
        # Validate input types
        if not isinstance(df, pd.DataFrame):
            raise TypeError("Input `df` must be a pandas DataFrame.")

        if pivot_col not in df.columns:
            raise ValueError(f"'{pivot_col}' column not found in DataFrame.")

        if check_col not in df.columns:
            raise ValueError(f"'{check_col}' column not found in DataFrame.")

        # Perform groupby and unique count
        sensorloc_counts = df.groupby(pivot_col)[check_col].nunique().reset_index()
        count_col = f'unique_{check_col}_count'
        sensorloc_counts.columns = [pivot_col, count_col]

        print(f'\nUnique "{check_col}" count per "{pivot_col}":')
        print(sensorloc_counts)

    except Exception as e:
        print(f"An error occurred while calculating unique counts: {e}")
        raise
In [14]:
find_unique_counts(df_microclimate,'device_id','sensorlocation')
Unique "sensorlocation" count per "device_id":
             device_id  unique_sensorlocation_count
0   ICTMicroclimate-01                            1
1   ICTMicroclimate-02                            1
2   ICTMicroclimate-03                            1
3   ICTMicroclimate-04                            1
4   ICTMicroclimate-05                            1
5   ICTMicroclimate-06                            1
6   ICTMicroclimate-07                            1
7   ICTMicroclimate-08                            1
8   ICTMicroclimate-09                            1
9   ICTMicroclimate-10                            1
10  ICTMicroclimate-11                            1
11           aws5-0999                            1
In [15]:
find_unique_counts(df_microclimate,'device_id','latlong')
Unique "latlong" count per "device_id":
             device_id  unique_latlong_count
0   ICTMicroclimate-01                     1
1   ICTMicroclimate-02                     1
2   ICTMicroclimate-03                     1
3   ICTMicroclimate-04                     1
4   ICTMicroclimate-05                     1
5   ICTMicroclimate-06                     1
6   ICTMicroclimate-07                     1
7   ICTMicroclimate-08                     1
8   ICTMicroclimate-09                     1
9   ICTMicroclimate-10                     1
10  ICTMicroclimate-11                     1
11           aws5-0999                     1

The above check show that each device has a consistent sensorlocation and latlong. Hence, for each unique device_id, we determine the unique value for both sensorlocation and latlong.

Approach:

Step 1: Group by device_id and extract unique values for sensorlocation and latlong. This helps identify the expected value associated with each device.

Step 2: For each missing sensorlocation, impute the value using the known unique mapping between device_id and sensorlocation.

Step 3: Repeat a similar process for latlong, filling in the missing values using the corresponding known latlong values for each device_id. The imputation is performed independently for each of the two columns to maintain clarity and avoid propagation of errors.

This method ensures consistency by maintaining the integrity of device-specific metadata, which is critical for further analysis.

In [16]:
print(df_microclimate[['sensorlocation', 'latlong']].isna().sum())
sensorlocation     6143
latlong           11483
dtype: int64
In [17]:
def fill_missing_values_using_device_id(df, pivot_col, fill_col):
    """
    Fills missing values in `fill_col` based on unique mapping from `device_col` 
    for devices that have exactly one known value in `fill_col`.

    Parameters:
    - df (pd.DataFrame): Input DataFrame
    - device_col (str): Column containing the device identifier
    - fill_col (str): Column to be filled based on unique device mapping

    Returns:
    - pd.DataFrame: DataFrame with missing values in `fill_col` filled where possible
    """
    try:
        if not isinstance(df, pd.DataFrame):
            raise TypeError("Input `df` must be a pandas DataFrame.")
        if pivot_col not in df.columns:
            raise ValueError(f"'{pivot_col}' column not found in DataFrame.")
        if fill_col not in df.columns:
            raise ValueError(f"'{fill_col}' column not found in DataFrame.")

        # Subset rows where both columns are not missing
        valid_rows = df.dropna(subset=[pivot_col, fill_col])

        # Count unique fill_col values per device
        unique_counts = valid_rows.groupby(pivot_col)[fill_col].nunique()

        # Filter devices with only 1 unique fill_col value
        single_value_devices = unique_counts[unique_counts == 1].index

        # Create mapping from device_id to that one known value
        mapping = (
            valid_rows[valid_rows[pivot_col].isin(single_value_devices)]
            .groupby(pivot_col)[fill_col]
            .first()
        )

        # Fill missing values using map
        df[fill_col] = df[fill_col].fillna(df[pivot_col].map(mapping))

        print(f'Missing values in "{fill_col}" filled using "{pivot_col}" where uniquely possible.')

    except Exception as e:
        print(f"An error occurred while filling missing values: {e}")
        raise

    return df
In [18]:
df_microclimate = fill_missing_values_using_device_id(df_microclimate,'device_id', 'sensorlocation')
df_microclimate = fill_missing_values_using_device_id(df_microclimate,'device_id', 'latlong')
Missing values in "sensorlocation" filled using "device_id" where uniquely possible.
Missing values in "latlong" filled using "device_id" where uniquely possible.
In [19]:
print(df_microclimate[['sensorlocation', 'latlong']].isna().sum())
sensorlocation    0
latlong           0
dtype: int64

The received_at column is converted to a proper datetime object to enable time-series analysis.

In [20]:
# Convert 'received_at' column to datetime format
df_microclimate['received_at'] = pd.to_datetime(df_microclimate['received_at'], errors='coerce')
print(df_microclimate['received_at'].head())
print(df_microclimate['received_at'].dtypes)
0   2025-02-09 00:54:37+00:00
1   2025-02-09 01:02:11+00:00
2   2025-02-09 01:03:24+00:00
3   2025-02-09 01:02:43+00:00
4   2025-02-09 01:17:37+00:00
Name: received_at, dtype: datetime64[ns, UTC]
datetime64[ns, UTC]

In order to addess the missing values in the rest of the numerical columns such as airtemperature, relativehumidity, pm25, pm10, noise, the dataset is first sorted by device_id and received_at to ensure chronological order for each sensor. Missing values in numeric columns are then filled using linear interpolation, which estimates values based on the average of preceding and following readings within each device group.

In [21]:
# Sort the DataFrame by device_id and received_at for time series continuity
df_microclimate.sort_values(by=['device_id', 'received_at'], inplace=True)

# Select numeric columns
numeric_cols = df_microclimate.select_dtypes(include=[np.number]).columns

# Apply linear interpolation within each device_id group
df_microclimate[numeric_cols] = (
    df_microclimate.groupby('device_id')[numeric_cols]
    .apply(lambda group: group.interpolate(method='linear', limit_direction='both'))
    .reset_index(drop=True)
)

# Check if missing values remain
print(df_microclimate[numeric_cols].isna().sum())
airtemperature          0
relativehumidity        0
pm25                12748
pm10                12748
noise               12748
dtype: int64

To facilitate spatial analysis, the latlong column was split into separate latitude and longitude columns. These new columns were then converted into numeric formats to allow for further computations and visualizations. Finally, the original latlong column was dropped to avoid redundancy, leaving a clean and structured dataset ready for spatial analysis and modeling.

In [22]:
#splitting geo coordinates
df_microclimate = split_geo_coordinates(df_microclimate,'latlong')

print('First few rows of the dataset after preprocessing:\n',df_microclimate.head(5))
Dataset Info after Geo Split:

<class 'pandas.core.frame.DataFrame'>
Index: 292472 entries, 93241 to 292466
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype              
---  ------            --------------   -----              
 0   device_id         292472 non-null  object             
 1   received_at       292472 non-null  datetime64[ns, UTC]
 2   sensorlocation    292472 non-null  object             
 3   airtemperature    292472 non-null  float64            
 4   relativehumidity  292472 non-null  float64            
 5   pm25              279724 non-null  float64            
 6   pm10              279724 non-null  float64            
 7   noise             279724 non-null  float64            
 8   latitude          292472 non-null  float64            
 9   longitude         292472 non-null  float64            
dtypes: datetime64[ns, UTC](1), float64(7), object(2)
memory usage: 24.5+ MB
None
First few rows of the dataset after preprocessing:
                 device_id               received_at  \
93241  ICTMicroclimate-01 2024-05-29 03:57:29+00:00   
26101  ICTMicroclimate-01 2024-05-29 04:12:30+00:00   
42960  ICTMicroclimate-01 2024-05-29 04:27:30+00:00   
42972  ICTMicroclimate-01 2024-05-29 04:42:31+00:00   
93244  ICTMicroclimate-01 2024-05-29 04:57:31+00:00   

                        sensorlocation  airtemperature  relativehumidity  \
93241  Birrarung Marr Park - Pole 1131            18.5         38.800000   
26101  Birrarung Marr Park - Pole 1131            24.0         57.599998   
42960  Birrarung Marr Park - Pole 1131            21.2         61.700000   
42972  Birrarung Marr Park - Pole 1131            19.4         68.400000   
93244  Birrarung Marr Park - Pole 1131            18.9         36.700000   

       pm25  pm10      noise   latitude  longitude  
93241   2.0   3.0  66.300000 -37.818593  144.97164  
26101   4.0   9.0  58.400002 -37.818593  144.97164  
42960   8.0  11.0  69.400000 -37.818593  144.97164  
42972   4.0   6.0  69.800000 -37.818593  144.97164  
93244   2.0   3.0  62.300000 -37.818593  144.97164  

Tree Planting Zone dataset

In [23]:
df_tree_planting = check_preprocess_dataset(df_tree_planting, 'Tree Planting Zone')
Dataset Information for "Tree Planting Zone":

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 839 entries, 0 to 838
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   geo_point_2d  839 non-null    object 
 1   geo_shape     839 non-null    object 
 2   str_from      0 non-null      float64
 3   segpart       0 non-null      float64
 4   statusid      0 non-null      float64
 5   segid         839 non-null    int64  
 6   streetname    0 non-null      float64
 7   streetid      0 non-null      float64
 8   schedule      839 non-null    object 
 9   mapstatus     0 non-null      float64
 10  str_to        0 non-null      float64
 11  segdescr      839 non-null    object 
dtypes: float64(7), int64(1), object(4)
memory usage: 78.8+ KB
None

Missing values in "Tree Planting Zone" dataset:

geo_point_2d      0
geo_shape         0
str_from        839
segpart         839
statusid        839
segid             0
streetname      839
streetid        839
schedule          0
mapstatus       839
str_to          839
segdescr          0
dtype: int64

No duplicate records found in "Tree Planting Zone".
In [24]:
#selecting relevant columns
df_tree_planting = df_tree_planting[['geo_point_2d', 'geo_shape', 'segid', 'schedule','segdescr']]

To facilitate spatial analysis, the geo_point_2d column was split into separate latitude and longitude columns. These new columns were then converted into numeric formats to allow for further computations and visualizations. Finally, the original geo_point_2d column was dropped to avoid redundancy, leaving a clean and structured dataset ready for spatial analysis and modeling.

In [25]:
df_tree_planting = split_geo_coordinates(df_tree_planting,'geo_point_2d')

print('First few rows of the dataset after preprocessing:\n',df_tree_planting.head(5))
Dataset Info after Geo Split:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 839 entries, 0 to 838
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   geo_shape  839 non-null    object 
 1   segid      839 non-null    int64  
 2   schedule   839 non-null    object 
 3   segdescr   839 non-null    object 
 4   latitude   839 non-null    float64
 5   longitude  839 non-null    float64
dtypes: float64(2), int64(1), object(3)
memory usage: 39.5+ KB
None
First few rows of the dataset after preprocessing:
                                            geo_shape  segid  \
0  {"coordinates": [[[[144.969387175266, -37.8029...  21556   
1  {"coordinates": [[[[144.96828098035, -37.81969...  22067   
2  {"coordinates": [[[[144.969341164027, -37.7965...  20697   
3  {"coordinates": [[[[144.950283591209, -37.7975...  21195   
4  {"coordinates": [[[[144.986101797552, -37.8163...  21945   

                          schedule  \
0  Not determined by precinct plan   
1                     Years 8 - 10   
2                      Years 5 - 7   
3                         Complete   
4  Not determined by precinct plan   

                                            segdescr   latitude   longitude  
0  Pelham Street between Rathdowne Street and Dru... -37.803061  144.968795  
1  Riverside Avenue between St Kilda Road and Sou... -37.819792  144.967511  
2  Little Palmerston Street between Rathdowne Str... -37.796602  144.969907  
3  Chapman Street between Errol Street and Harker... -37.797745  144.950262  
4  Wellington Parade between Simpson Street and P... -37.816525  144.986495  

Data Analysis and Visualisation¶

Tree Canopies 2021 dataset¶

The below visualisation of illustrates the spatial extent of tree canopy coverage across Melbourne by transforming polygon data into geospatial geometries. The processed shapes were plotted on a static map to highlight zones with dense vegetation.

In [26]:
# convert Tree Canopies dataset into a GeoDataFrame
df_tree_canopy_2021['geometry'] = df_tree_canopy_2021['geo_shape'].apply(lambda x: shape(json.loads(x)))

# Create GeoDataFrame
gdf = gpd.GeoDataFrame(df_tree_canopy_2021, geometry='geometry', crs='EPSG:4326')

# Plot static map
fig, ax = plt.subplots(figsize=(12, 10))
gdf.plot(ax=ax, color='green', edgecolor='darkgreen', alpha=1, label='Canopies')
ax.set_title('Tree Canopy Coverage (2021) - Melbourne', fontsize=14)
ax.set_xlabel("Longitude", fontsize=10)
ax.set_ylabel("Latitude", fontsize=10)
plt.show()
No description has been provided for this image

The chart shows a dense distribution of tree canopies in specific parts of the region, highlighted in green. These visualizations are critical for identifying areas that already have substantial green coverage and pinpointing regions that require further intervention to enhance urban greenery. This analysis supports strategic planning for biodiversity conservation, reduction of urban heat islands, and improving the quality of public spaces.

In the below visualisation, segmenting the geographic area into a grid by binning both latitude and longitude values, then counting how many tree canopy points fall into each grid cell. These counts are visualised as a 2D heatmap, where darker green shades represent higher tree canopy density across Melbourne.

In [27]:
# Bin both longitude and latitude
df_tree_canopy_2021['longitude_bins'] = pd.cut(df_tree_canopy_2021['longitude'], bins=5)
df_tree_canopy_2021['latitude_bins'] = pd.cut(df_tree_canopy_2021['latitude'], bins=5)

# Group by both bins
lat_lon_counts = df_tree_canopy_2021.groupby(['latitude_bins', 'longitude_bins']).size().unstack(fill_value=0)

# Plot heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(lat_lon_counts, cmap='Greens', annot=True, fmt='d', linewidths=0.5)
plt.title("Tree Canopy Distribution by Latitude and Longitude Bins", fontsize=14)
plt.xlabel("Longitude Range")
plt.ylabel("Latitude Range")
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
No description has been provided for this image

From the above heatmap, noting that the central and southeastern grid cells show the highest tree canopy density, indicating dense vegetation in those zones. In contrast, the northwest and outer edges have significantly fewer data points, suggesting lower canopy coverage or less vegetation data available in those areas. This spatial insight is valuable for targeting urban greening efforts where canopy presence is sparse.


Microclimate Sensor dataset¶

The microclimate dataset analysis focuses on two critical aspects: air temperature variation and PM2.5 pollution levels. The process begins by adding a month column to each record, which enables the grouping of data by both month and device. Next, average values for PM2.5 and air temperature are computed for each sensor across the months. This step prepares the data to reveal seasonal trends and sensor-specific behaviour. The resulting averages help in understanding environmental patterns that are crucial for urban planning and pollution management.

The box-and-whisker format helps identify both the typical range (the box) and the more extreme daily averages (the whiskers and any outliers). The taller boxes or higher medians tend to have more elevated PM2.5 & Air Temperature levels or wider day-to-day fluctuations. The spread of the boxes reveals how stable or variable temperatures are on a daily basis, and outliers may point to unusual temperature spikes or dips on specific days.

In [28]:
# Create a 'month' column
df_microclimate['month'] = df_microclimate['received_at'].dt.to_period('M').astype(str)

# Group by month and device to calculate the average PM2.5 * Air Temperature
avg_pm25 = df_microclimate.groupby(['month', 'device_id'])['pm25'].mean().reset_index()
avg_air_temp = df_microclimate.groupby(['month', 'device_id'])['airtemperature'].mean().reset_index()

# Create a box plot: x-axis as month, y-axis as average PM2.5
fig1 = px.box(avg_pm25, 
              x='month', 
              y='pm25', 
              title="Distribution of Average PM2.5 per Month Across Devices",
              labels={"pm25": "Average PM2.5", "month": "Month"})
fig1.show()

# Create a box plot: x-axis as month, y-axis as average Air Temperature
fig2 = px.box(avg_air_temp, 
              x='month', 
              y='airtemperature', 
              title="Distribution of Average Air Temperature per Month Across Devices",
              labels={"airtemperature": "Average Air Temperature", "month": "Month"})
fig2.show()

Distribution of Average PM2.5 per Month Across Devices:¶

The box plot displays the distribution of average PM2.5 levels across different months for all devices. It highlights the central tendency and variability in pollution levels, clearly showing which months record higher or lower PM2.5 concentrations. Observations from this plot reveal that the months from December to March, likely influenced by the summer season, exhibit a higher median PM2.5. Additionally, the plot shows noticeable outliers, indicating that on certain days, there are unexpected spikes in pollution levels. These variations could be due to transient local events or differences in microclimate conditions across sensor locations.

Distribution of Average Air Temperature per Month Across Devices:¶

The second box plot presents the monthly distribution of average air temperature across the sensors, offering a clear view of how temperature varies over time. Observations from this plot indicate that during the summer months, temperatures consistently register at higher levels, with a relatively tighter range. This suggests that warm conditions are more uniform across the sensor locations during summer. In contrast, the winter months show lower median temperatures accompanied by a wider spread, which implies greater variability in cooler conditions possibly due to diverse microclimate influences. Occasional outliers in both seasons could be highlighting instances of sudden temperature fluctuations or extreme weather events


Building upon the previous analysis, we narrow our focus to January—the month with the highest PM2.5 levels. The dataset is filtered to include only January's records, and a new date column is created to group data by each day. For every device, daily averages for PM2.5 and air temperature are then computed, providing a more granular view of environmental conditions. This detailed approach reveals short-term fluctuations and local effects that might be obscured in broader monthly averages, offering deeper insights for targeted urban and pollution management strategies. Additionally, the analysis aims to observe the variance across all devices to determine whether the differences in daily averages are consistent or if certain locations exhibit significantly different patterns.

In [29]:
# Filter for January data
january_data = df_microclimate[df_microclimate['received_at'].dt.month == 1].copy()

# Create a 'date' column (year-month-day) for grouping
january_data['date'] = january_data['received_at'].dt.date

# Group by device and date to calculate the daily average PM2.5
daily_avg_pm25 = january_data.groupby(['device_id', 'date'])['pm25'].mean().reset_index()
daily_avg_air_temp = january_data.groupby(['device_id', 'date'])['airtemperature'].mean().reset_index()

# Create a box plot: x-axis as device_id, y-axis as the daily average PM2.5
fig1 = px.box(daily_avg_pm25, 
             x='device_id', 
             y='pm25', 
             title="Average PM2.5 in January Across Devices",
             labels={"device_id": "Device ID", "pm25": "Daily Average PM2.5"})

fig1.show()

# Create a box plot: x-axis as device_id, y-axis as the daily average PM2.5
fig2 = px.box(daily_avg_air_temp, 
             x='device_id', 
             y='airtemperature', 
             title="Average Air Temperature in January Across Devices",
             labels={"device_id": "Device ID", "pm25": "Daily Average Air Temperature"})

fig2.show()

Average PM2.5 in January Across Devices:¶

This box plot clearly indicates that PM2.5 readings vary significantly among devices. Some devices consistently record higher median pollution levels with minimal variability, while others show wider ranges and occasional spikes. This suggests that certain locations are more prone to higher pollution levels, likely due to local factors that contribute to uneven distribution. Overall, the box plot offers detailed insights into daily environmental conditions across devices in January, highlighting the differences that are critical for targeted urban and pollution management strategies.

Average Air Temperature in January Across Devices:¶

This box plot reveals how air temperature varies across devices during January. While most devices report similar median temperatures, the extent of daily variation differs noticeably—some sensors record stable, consistent conditions, while others show larger fluctuations. This suggests that localized microclimate effects may be influencing temperature readings. Overall, the observations help determine whether the variance in daily averages is consistent across all devices.


Building upon our previous analyses, the next step delves into the relationship between PM2.5 air temperature and relative humidity by calculating their correlation. The code extracts these three variables and computes a correlation matrix, which is then visualised as a heatmap. Each cell of the heatmap is annotated with the corresponding correlation coefficient, making it easier to interpret the strength and direction of the relationship. This approach provides valuable insights into how variations in air temperature and relative humidity is linked to changes in PM2.5 levels, offering further depth to our analysis.

In [30]:
# Compute the correlation matrix for PM2.5 and Air Temperature
corr_matrix = df_microclimate[['pm25', 'airtemperature','relativehumidity' ]].corr()

# Create a heatmap using Matplotlib
fig, ax = plt.subplots(figsize=(6, 4))
cax = ax.imshow(corr_matrix, cmap='viridis', interpolation='nearest')

# Set axis ticks and labels
ax.set_xticks(np.arange(len(corr_matrix.columns)))
ax.set_yticks(np.arange(len(corr_matrix.index)))
ax.set_xticklabels(corr_matrix.columns)
ax.set_yticklabels(corr_matrix.index)

# Rotate x-axis labels for better readability
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")

# Annotate each cell in the heatmap with the correlation value
for i in range(len(corr_matrix.index)):
    for j in range(len(corr_matrix.columns)):
        ax.text(j, i, f"{corr_matrix.iloc[i, j]:.2f}", ha="center", va="center", color="w")

# Add title and colour bar
ax.set_title("Correlation Heatmap for PM2.5 and Air Temperature")
fig.colorbar(cax, ax=ax)

plt.tight_layout()
plt.show()
No description has been provided for this image

The above correlation heatmap shows that PM2.5 levels have a moderate negative relationship with both air temperature and relative humidity. Specifically, as air temperature increases, PM2.5 levels tend to decrease, suggesting that warmer conditions may facilitate the dispersion of airborne pollutants. Similarly, higher relative humidity is associated with lower PM2.5 levels, indicating that increased moisture in the air might help in settling fine particles. Additionally, there is a relatively stronger negative correlation between air temperature and relative humidity, which is expected because warmer air can hold more moisture, thereby reducing the relative humidity percentage. Overall, the data suggests that both higher temperatures and increased humidity may contribute to lower concentrations of PM2.5 in the air.

Tree Planting Zone dataset¶

The following analysis examines tree planting data through a visual summary that highlights the distribution of segment or zone counts across different project schedules. The data is grouped by project timeframes to provide clear insights into how many segments or zones are at each stage. Custom ordering of the schedule column enhances readability and ensures the visualisation delivers more meaningful insights.

In [31]:
# Group by schedule and count the number of segid
grouped = df_tree_planting.groupby('schedule')['segid'].count().reset_index()
# define the custom order to improve the readability
custom_order = ['Complete', 'Years 1 - 4', 'Years 5 - 7', 'Years 8 - 10', 'Not determined by precinct plan']
# Convert the 'schedule' column to a categorical type with the specified order
grouped['schedule'] = pd.Categorical(grouped['schedule'], categories=custom_order, ordered=True)

# Sort the grouped data by the custom order
grouped.sort_values('schedule', inplace=True)

# Create the bar plot
plt.figure(figsize=(10, 6))
plt.bar(grouped['schedule'], grouped['segid'])
plt.xlabel('Schedule')
plt.ylabel('Number of Segments/Zones')
plt.title('Segments by Timeframe')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()
No description has been provided for this image

From the chart, we can observe that the tree planting initiatives are at different stages across the zones. Approximately one-quarter of the zones have already completed their tree planting, indicating significant progress in these areas. The dataset reveals that the most intensive planting activity is planned for Years 5-7, which is the peak period for upcoming projects. This is followed by a substantial number of projects scheduled for Years 8-10, reflecting medium-to-long term plans. In contrast, the early-stage schedule (Years 1-4) shows the least activity, suggesting that immediate planting efforts are relatively fewer. Additionally, less than a quarter of the zones have an undetermined timeframe for planting, pointing to some uncertainty or pending decisions in these areas.


The below visualisation combines spatial data—the locations of planting zones—with project schedule details to offer a comprehensive, bird’s-eye view of tree planting across Melbourne. It clearly shows how different zones are progressing, whether they are complete or planned for future periods.

The process begins by verifying that essential location details (longitude and latitude) are available, as these are crucial for accurately plotting the zones on a map. Once confirmed, the data is transformed to align with Melbourne’s mapping system, ensuring each planting zone appears in its correct location. The zones are then colour-coded based on their project schedule, enabling quick visual differentiation of their progress. Finally, a background map of Melbourne is added, providing familiar context and making it easier to understand the spatial distribution and planning of the tree planting projects.

In [32]:
# Replace 'longitude' and 'latitude' with the actual column names if different
if 'longitude' in df_tree_planting.columns and 'latitude' in df_tree_planting.columns:
    gdf = gpd.GeoDataFrame(
        df_tree_planting, geometry=gpd.points_from_xy(df_tree_planting.longitude, df_tree_planting.latitude),
        crs="EPSG:4326"
    )
else:
    raise ValueError("The dataset must contain 'longitude' and 'latitude' columns.")

# Convert GeoDataFrame to Web Mercator (EPSG:3857) for compatibility with basemaps
gdf = gdf.to_crs(epsg=3857)

# Plotting the GeoDataFrame, colour coding by 'schedule'
fig, ax = plt.subplots(figsize=(12, 8))
gdf.plot(ax=ax, column='schedule', categorical=True, legend=True, markersize=25, legend_kwds={'loc': 'lower left'})

# Add a basemap (using Stamen Toner Lite tiles)
ctx.add_basemap(ax, source=ctx.providers.CartoDB.Positron)

# Set title and remove axis for a cleaner look
ax.set_title('Planting Zones on Melbourne Map by Schedule')
ax.set_axis_off()

plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]: